A note on the application of the Oakes' identity to 
obtain the observed information matrix of hidden 



Markov models 

Francesco Bartolucci, Alessio Farcomeni and Fulvia Pennoni 

January 31, 2012 

Abstract 

We derive the observed information matrix of hidden Markov models by the ap- 
plication of the Oakes (1999) ! s identity. The method only requires the first derivative 
of the forward-backward recursions of Baum and Welch (1970), instead of the sec- 
ond derivative of the forward recursion, which is required within the approach of 
Lystig and Hughes (2002). The method is illustrated by an example based on the 
analysis of a longitudinal dataset which is well known in sociology. 

Keywords: Expectation-Maximization algorithm, Local identifiability, Latent Markov 
model, Longitudinal data, Standard Errors 

1 Introduction 

Hidden Markov (HM) models have been developed early in the literature on stochastic 
processes as extensions for measurement errors of the standard Markov chain model; for 
one of the oldest relevant contributions about these models, see Baum and Petrie (1966). 



HM models have received much attention in the time-series analysis literature, due to 
their wide applicability and easy interpretation (for an up-to-date review see Zucchini 
and MacDonald, 2009). These models are also finding an increasing popularity for the 
analysis of longitudinal data (see Bartolucci et al., 2010). 

The main tool for maximum likelihood (ML) estimation of the parameters of an 
HM model is the Expectation-Maximization (EM) algorithm, which is based on certain 
forward-backward recursions. This algorithm and these recursions were developed by 
Baum and colleagues in a series of papers specifically for HM models (Baum and Petrie, 
1966; Baum and Egon, 1967; Baum et al., 1970). Then, the EM algorithm was put in a 
more general context in the widely cited paper of Dempster et al. (1977). 

A drawback of the afore mentioned algorithm is that it does not provide, as a by- 
result, the standard errors for the parameter estimates. This is because it uses neither 
the observed nor the expected information matrix, which are suitable transformations 
of the second derivative matrix of the model log-likelihood. From the inverse of these 
matrices, we obtain standard errors for the parameter estimates. Then, from the output 
of the EM algorithm, we have not an obvious method for assessing the precision of these 
maximum likelihood estimates. The information matrix is also important to check the 
local identifiability of the model through its rank; see McHugh (1956) and Goodman 
(1974) among others. 

Computing the information matrix (observed or expected) of a latent variable model, 
as an HM model, is considered a difficult task. Several methods have been proposed to 
overcome this difficulty; for a review see Lystig and Hughes (2002) and McLachlan and 
Krishnan (2008). One of the more interesting solutions was proposed by Louis (1982). 
This solution is based on the missing information principle as defined by Orchard and 
Woodbury (1972). According to this principle, the observed information matrix can be 
expressed as the difference between two matrices corresponding to the complete informa- 
tion, which we would be able to compute if we knew the latent states, and the missing 
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information due to the unobserved variables. However, this correction term is difficult in 
general to compute; see Oakes (1999) for further comments and Turner et al. (1998) for 
related techniques. 

Oakes (1999) presented an alternative approach, with respect to that of Louis (1982), 
to compute the observed information matrix of a latent variable model. In particular, he 
derived an explicit formula for the second derivative matrix of the model log-likelihood 
which involves the first derivative of the conditional expectation of the score of the com- 
plete data log-likelihood, given observed data. 

Specifically for HM models, Lystig and Hughes (2002) proposed a method for exactly 
computing the observed information matrix based on the second derivative of the forward 
recursion of Baum et al. (1970) which is used to compute the model log-likelihood; for 
a similar method see Bartolucci (2006). The method of Lystig and Hughes (2002) has 
become rather popular in the HM literature. Among the methods related to the EM 
algorithm, we also mention that proposed by Bartolucci and Farcomeni (2009) which is 
very simple to implement and requires a small extra code over that required for the ML 
estimation. However, since it is based on the numerical derivative of the score, the ob- 
tained information matrix may be considered an approximation of the true one. Also note 
that, in order to obtain standard errors for the parameter estimates, we can alternatively 
use a parametric bootstrap method (Efron and Tibshirani, 1993), as described in Zuc- 
chini and MacDonald (2009). Even if the standard errors obtained in this way may be 
more reliable with respect to those based on the information matrix, the method may be 
computationally costly and, in any case, does not allow us to check for local identifiability 
in an obvious way. 

In this paper, we show how to apply the Oakes (1999) identity to obtain the observed 
information matrix of an HM model. As we will show, the proposed method only requires 
the first derivative of the forward-backward recursions of Baum et al. (1970), whereas 
the method of Lystig and Hughes (2002) requires the second derivative of the forward 
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recursion. On the other hand, the proposed method is superior to that of Bartolucci and 
Farcomeni (2009) since it allows us to exactly compute the observed information matrix. 
To the best of our knowledge, an implementation of the Oakes (1999) 's identity for HM 
models, as the one we propose here, is not available in the literature. 

The proposed approach is illustrated through an application based on a well-known 
longitudinal dataset. For the specific HM model used in this application, we make avail- 
able some R functions 1 to compute the information matrix and then obtaining the standard 
errors for the parameter estimates. 

In the following, we first briefly review the EM algorithm and the Oakes (1999) 's 
identity in their general versions. In Section 3 we propose an implementation of this 
identity for HM models on the basis of a suitable reparametrization. Then, in Section 4 
we describe the application of the proposed method in connection with the analysis of the 
dataset mentioned above. 

2 Preliminaries 

We give in this section the necessary background about the EM algorithm and the Oakes 
(1999) 's identity in general; then we recall some important features about HM models. 

2.1 EM algorithm and observed information matrix 

The EM algorithm (Dempster et al., 1977) is an iterative algorithm for finding the ML 
estimator of models with missing variables and has a special role in the literature on latent 
variable models. 

With reference to an observed sample, let £(0) denote the log-likelihood of the latent 
variable model of interest, where 6 is the vector of parameters. As it is well known, the 
EM algorithm is based on the complete data log-likelihood, denoted as £*(0), which is the 
Hhrough a website to be indicated later 
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log-likelihood that we could compute if we knew the value of the latent variables for each 
every sample unit. In particular, to maximize £(0), the algorithm alternates the following 
steps until convergence: 

• E-step: compute the conditional expected value of the complete data log-likelihood 
given the current estimate of 0, denoted by 0, and the observed data. This expected 
value is denoted by Q(0\6); 

• M-step: maximize Q{0\0) with respect to 0. 

We now consider the score and the observed information matrix corresponding to the 
model log-likelihood £(0). These are defined, respectively, as 

It may be simply proved that 

dQ{0\0) 



s(0) 



do 



0=0 

Consequently, the Oakes (1999) 's identity states that: 



J{0) 



d 2 Q{0\0) 



dOdO' 



d 2 Q{0\0) 



o=o d ^ d6 ' 



(1) 



0=0 J 

This identity then involves two components. The first component is the second derivative 
of the conditional expected value of the complete-data log-likelihood given the observed 
data. This component is simple to obtain from the EM algorithm. The second component 
involved in (1) is the first derivative of the score for the same expected log-likelihood with 
respect to the current value of the parameters. 

2.2 Hidden Markov models 

Consider now a sequence of T response variables , . . . , Y ^ , which are collected in the 
vector Y. These response variables may be continuous or categorical and we may even 



observe a vector of multivariate outcomes at each t. In the following, we briefly review 
the assumptions of an HM model for these data and then how to apply the EM algorithm 
for ML estimation of the resulting model. 

2.2.1 Assumptions 

An HM model relies on the following basic assumptions: 

• the response variables , . . . , are conditionally independent given a sequence 
of unobserved variables . . . , giving rise to a latent process vector denoted 
by U; 

• every response variable Y^\ t = 1, . . . ,T, depends on the latent process U only 
through [/(*); 

• the latent process U follows a Markov chain with k states labelled from 1 to k. 
We consider in particular HM models in which: 

• the conditional distribution of given is time-homogenous; 

• the latent Markov chain is of first-order and time-homogeneous. 

Parameters of the model are then the initial probabilities of the latent process, denoted 
A« = fuw( u ) with u = l,...,k and the transition probabilities 7r u |„ = fu(t)\u(t-^)( u \u) 
with t = 2, . . . , T and u,u = 1, . . . , k. The initial probabilities are collected in the k- 
dimensional column vector A and the transition probabilities are collected in the k x k 
matrix II, with each row denoted tz' Qi where ir u = (ni\n, . . . , ^k\u)' ■ 

In the above expressions, f V (t) (u) denotes the probability mass function of the distri- 
bution of whereas fu(.*)\u(- t - 1 ')( u \^') denotes the probability mass function of given 
C/(* -1 ). A similar convention will be used to denote density functions. 

Furthermore, when the response variables are categorical with a reduced number of 
categories (labelled from 1 to c), we introduce the additional notation (f> y \ u = fYW\u( t )(v\ u ) 
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with u — 1, . . . , k and y — 0, . . . , c — 1. The probabilities are collected in the c x k matrix 
$ which is made of the column vectors <f> u , with <f> = (<f>i\ u , . . . , C | U )'. 

2.3 Application of the EM algorithm 

It is well known that the above model may be estimated by an EM algorithm formulated 
as in Baum et al. (1970); see also Bartolucci et al. (2010) and Zucchini and MacDonald 
(2009). 

Suppose that we observe n > 1 independent realizations of Y, denoted by y 1 , . . . y n , 
with every y { having elements yf\ t — 1, . . . , T. Note that, in the case of time-series data, 
we can only observe a single realization of Y and then n — 1; in this case, T is typically 
large. On the other hand, in the case of longitudinal data, n is often large as compared 
to T. Our results apply invariably and the model log-likelihood may be expressed as 

tin) = J2 l °sfY(yi) = J2 n y l °sfY(y), 

i y 

where r) is a vector containing all the parameters in 7r, and II, fy(y) is the prob- 
ability mass function of Y seen as a function of rj. This function can be computed by 
a forward recursion which is described in Appendix 1. Moreover, n y is frequency of the 
response configuration y = (y^\ . . . ^y^)' and the sum J2 y is extended to all response 
configurations observed at least once. 

We now specialize the EM algorithm for the case of categorical outcomes mentioned at 
the end of the previous section. Let a^, with t — 1, . . . , T, u — 1, . . . , k, y — 0, . . . , c — 1, 
denote the frequency of = u and = y, let b$ , with t — 1, . . . , T, u — 1, . . . , k, 
denote the frequency of = u, and let c^l, with t — 2, . . . , T, u, u — 1, . . . , k, denote 
the joint frequency of the latent states U^' 1 ^ = u and = u. Every E-step of the EM 
algorithm consists of computing the conditional expected value of these frequencies given 
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the observed data and the current value of the parameter vector denoted by 17, that is 

«8 = Efu^Y(u\y^I(y ( : t) = y) = J2nyf u ^Y(u\y)I(y (t) =y), (2) 
i y 

i>u ] = Y,fuw\Y(u\y i ) = Y, n yfu( t )\Y(u\y), (3) 
i y 

cf u = ^fu(t-i),uw\Y(y\yi) = J2 n yfu( t -v,uM\Y( u \y)> ( 4 ) 

i y 
where l(-) is the indicator function equal to 1 if its argument is true. These expected 
values involve posterior probabilities that may be computed by recursions illustrated in 
Appendix 1. 

Then, the M-step consists of maximizing the conditional expected value, given the 
observed data and f), of the complete data log-likelihood, which may be decomposed as 

Q(r,\f,) = Qi(*|»j) + Q 2 (7r|77) + Q 3 (n|77), 

with 

Qi(*|t)) = EEE a S lo g^. 

y t u 
Q 2 (7T\f,) = E&WlogA,,, 

u 

Qs(n|fj) = 5ZSS^ log7ru|a " 

t>l u u 

Explicit expressions are available to maximize separately each of these expressions. In 
fact, at every M-step <p y \ u is set proportional to Y^t^uyi ^« to and n u \ u to St>i c^ u ; 
see Bartolucci et al. (2010) for a more detailed description. 

3 Observed information matrix for HM models 

First of all, we consider a reparametrization of the model such that the new parameter vec- 
tor, denoted by 6, is variation independent and is contained in TV for a suitable s. Then, 
we show how to implement the Oakes (1999) 's identity by exploiting this reparametriza- 
tion. 
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3.1 Reparametrization of the model 

The conditional response probabilities are reparametrized through c — 1 logits referred to 
the first category, that is 

ot y \u = log it = 1, . . . , k, y = 1, . . . , c - 1, 

01|« 

which are included in the (c— l)-dimensional column vectors a u ; moreover, by a we denote 
the vector made of the subvectors oc±, . . . , et k . It is worth noting that the choice of the 
baseline category is irrelevant for the inference and shall be guided only by interpretability 
reasons. The initial probabilities are transformed similarly by the logits 

/? u = log^±i, u = l,...,k- 1, 
M 

which are collected in the (k — 1) -dimensional column vector (3. Finally, the transition 
probabilities are parametrized through logits referred to the diagonal element, that is 

7 Su = log^, u,u= l,...,k, u^u, 

^u\u 

which are collected in the (k — l)-dimensional vectors 7 S for u — 1, . . . , k] we also denote 
by 7 the overall vectors made of the subvectors j 1 , . . . , j k . 

It is convenient to express the above vectors of logits in matrix notation. In particular, 
we can easily show that a. may be obtained by stacking the vectors 

ol u = A\ogcj) u , u=l,...,k, 

where A = ( — l c _i I c -i ), with 1^ denoting a column vector of h ones and an identity 
matrix of the same dimension. The inverse transformation is 

- /0' c _i\ 

</>„ = [lfcexp(Aa u )] i exp(Aa u ), A=\ j, (5) 
where Oh is column vector of h zeros. Similarly, we have that 

f3 = B\og\, 
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with B = ( — lfc_i Jfc-i ); the inverse transformation of the last expression is defined as 
in (5) on the basis of the matrix B defined in a similar way as A. Finally, the vector 8 
is made of the subvectors d a , u = 1, . . . , k, with 

6 a = C u log7T s , 

where 



C 



I u— 1 lit— 1 Ou— l,k— u 

= I 

Ok u,u— 1 lfc— it I k—u 



with O/jj denoting an /i x j matrix of zeros. The inverse transformation, to obtain ir u 
from is as in (5), with A substituted by 

/ lu-l U -l,k-u^ 
C u = u -i 0' k -u 

V Ok u,u— 1 Ik—u I 

The new vector of parameters 6 is obtained by stacking the single parameters vectors, 
that is = (a',/3',7')'. Obviously, provided that all probabilities ir y \ u , X u , and tt u \ u 
are strictly positive, 6 e 7?. s , with s = (c — + — 1 + — 1), and is a one-to- 
one transformation of the original parameter vector 77, which instead belongs to a more 
complex space. 

3.2 Computing the observed information matrix 

First of all, adopting the above reparametrization, the expected value of the complete 
data log-likelihood may be expressed as 

Q{B\9) = Qx(a|0) + Q 2 (J3\0) + Qsh\6), 
where, using the matrix notation, we have 

Qi(a\0) = J2 a 'u l °S^u, 

u 

Q 2 ((3\0) = (6 (1) )'logA, 
QshW) = H^logTTs, 

u 
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with a u denoting a column vector with elements J2t ^uy-> y — 0, ■ ■ ■ ,c — 1, denoting a 
column vector with elements u = 1, . . . , k, and c u denoting a vector with elements 
J2t>i c 1m) u = 1, • • • , fc. Consequently, by applying standard rules about log-linear models, 
we have the following score vectors for the complete-data log-likelihood: 

8Q 1 (a\6) 



da 



= £A'(o»-*i"0J, (6) 



*hm = &{ i m -nX), (7, 



8/3 

- E^-^o)- ( 8 ) 



where = diag(</> u ) — <fi u <fi' u , Q\ and fi^- are defined in a similar way, and = 
J2t>ibu ^ ■ Similar, we have the second derivative matrices: 

d 2 Q 2 ((3\e) 



8(38(3 
d 2 Q 3 h\0) 



-nB fl x B, 



d-yd-y 

It is straightforward to see that the second derivative in (1) is a block-diagonal matrix, 
with blocks corresponding to above three derivatives, that is 

8 2 Q(6\6) f<9 2 Qi(a|0) 8 2 Q 2 ((3\6) d 2 Q 3 (j\6)\ 

diag 



dode' ' 6 V dadcx ' 8/38/3 ' djdj j ' 

Moreover, in order to compute the second component in (1) we need the first derivatives 
of the expected frequencies in (6), (7), and (8) with respect to 6. More precisely, we have 
8 2 Q{6\6) (PQ^alO) 8 2 Q 2 {(3\6) d 2 Q 3 ( 7 \6)\ 



where 



dodo' " I deda' ' 868(3' ' oedj' 



W) _ (9) 



868a.' V V dd 86 

B, (10) 



a 2 g 2 (/3|0) d(b {1) y 



868(3' 86 
dQMd) ^(dc', db^ , 
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How to compute the first derivatives of the above expected values with respect to is 
shown in Appendix 2. 

Once the observed information at the ML estimate of 6 has been obtained through (1) 
exploiting the above results, on the basis of this matrix we can obtain the standard errors 
and check identifiability in the usual way. In particular, the standard errors are obtained 
by computing the square root of the elements in the main diagonal of J(O)^ 1 . Then, local 
identifiability is checked through the rank of J(0); nevertheless, that this matrix is of full 
rank is required in order to compute its inverse. 

Note that the standard errors obtained as above are referred to the ML estimate of 
the parameter vector 6. However, we can simply express the standard errors for the cor- 
responding estimate of the initial parameter vector r) by the delta method. In particular, 
we first compute 



V=V/ 



V=Vj 



to estimate the variance-covariance matrix of f] and then we obtain the corresponding 
standard errors as the square root of the elements in the main diagonal of this matrix. In 
particular, the derivative matrix of with respect of rj may be simply constructed as a 
block diagonal matrix with blocks corresponding to the derivative of ol u with respect to 
every (f>' u , u — 1, . . . , k, to the derivative of (3 with respect to it', and to the derivative of 
7r s with respect to u — 1, . . . , k. For instance, we have 

dcf> 

and in similar way we can compute the other other blocks. 

Finally, it is important to consider that the method described above may be simply 
adapted to more sophisticated HM models in which, for instance, the transition proba- 
bilities are time-heterogeneous, the distribution of the response variables given the latent 
state is assumed to belong to a certain parametric family, and/or covariates are included 
in the model; see Bartolucci et al. (2010). However, we prefer to focus on a specific, but 
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important, HM model in order to make the description of the proposed methods simpler 
to understand. 

4 Example 

In order to illustrate the proposed approach, we analyze a well-known dataset based on 5 
annual waves of the National Youth Survey (Elliot et al., 1989). The dataset concerns 237 
individuals who were aged 13 years in 1976. The use of marijuana was measured by an 
ordinal response variable for each wave, having the following three categories: "never in 
the past year" (coded as 1); "no more than once in a month in the past year" (coded as 2); 
"once a month in the past year" (coded as 3). Such data have been also used for empirical 
illustrations by Lang et al. (1999), Vermunt and Hagenaars (2004), and Bartolucci (2006). 

With k = 2 we obtain the estimates of the conditional response probabilities displayed 
in Table 1. The table also reports the standard errors obtained with the proposed method 
and those obtained using the parametric bootstrap with a number of sample repetitions 
equal to 1000. Moreover, in Tables 2 and 3 we show the estimates of the initial probabilities 
and of the transition probabilities respectively, together with the corresponding standard 
errors. 



y 


estimate 


s.e. 


boot 


.s.e. 


u = 1 u = 2 


u = 1 


u = 2 


u = 1 


u = 2 


l 


0.9552 0.0791 


0.0137 


0.0338 


0.0096 


0.0315 


2 


0.0437 0.4623 


0.0131 


0.0339 


0.0090 


0.0338 


3 


0.0011 0.4586 


0.0024 


0.0398 


0.0024 


0.0358 



Table 1: Estimates of the parameters <p y \ u and corresponding standard errors obtained by 
the proposed method (s.e.) and a parametric bootstrap method based on 1,000 samples 
(boot. s.e.). 



13 



u 


est. 


s.e. 


boot. s.e. 


1 


0.9466 


0.0178 


0.0166 


2 


0.0534 


0.0178 


0.0166 



Table 2: Estimates of the parameters X u and corresponding standard errors obtained by 
the proposed (s.e.) method and a parametric bootstrap method based on 1,000 samples 
(boot. s.e.). 





est. 


se 




se.boot. 


u 


u = 1 u — 2 


u= 1 


u = 2 


u = 1 u = 2 


1 


0.8774 0.1226 


0.0157 


0.0157 


0.0140 0.0140 


2 


0.0319 0.9681 


0.0316 


0.0316 


0.0268 0.0268 



Table 3: Estimates of the parameters tt u \ u and corresponding standard errors obtained 
by the proposed method (s.e.) and a parametric bootrap method based on 1,000 samples 
(boot. s.e.). 

For this application, through the proposed recursion we easily obtain the standard er- 
rors for the parameter estimates. Moreover, as shown in Tables 1, 2, and 3, these standard 
errors are always very close to the corresponding parametric bootstrap standard errors. 
This confirms the validity of the proposed method to compute the observed information 
matrix. 

We also estimated the HM model k = 3 classes, however the information matrix J(0) 
is singular because one of the transition probabilities becomes equal to 0, so that we 
cannot state that this model is locally identifiable. 
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Appendix 1: Efficient implementation of recursions 



Manifest distribution of the response variables 

In order to efficiently compute the manifest probability let q®(y) denote the 

column vector with elements fu(t),YW,...,Y(t) ( u , ■ ■ ■ , V®), for w = 1, . . . , k. Then, we 



have 



diag(m 2/( i))A, t = 1, 



q(t){y) ~ { S^K^nV-^Cy), t = 2,...,T, (12) 
where m y is a A; dimensional column vector containing the probabilities (f> y \ u , u = 1, . . . , k. 
At the end of this recursions we obtain fy(y) as q( T \y)'l, where 1 denotes a column 
vector of ones of suitable dimension. In implementing this recursion, attention must be 
payed to the case of large values of T because, as t increases, the probabilities in q®(y) 
could become negligible; see Scott (2002) for remedial measures. 

In the multivariate case, the same recursion as in (12) may be used, with m y substi- 
tuted by the vector m y with elements corresponding the conditional probability of the 
response vector y given every possible value of the corresponding latent state. For fur- 
ther details on this, and the following recursion, see Zucchini and MacDonald (2009) and 
Bartolucci et al. (2010). 

Posterior distribution of the latent variables 

Let q®(y) be the column vector with elements f Y (t+i), 2/ <+1 \ • • • , y^), u = 
1, . . . , k. This vector may computed by the backward recursion 

-(*)/ \ _ J 1) t = T, 

Q [y) -\ndmg(rn y(t+1) )q^(y), f = T-l,...,l. 

Then, the A;- dimensional column vector f^\y) with elements fuW\Y{ u \v)i u = 1, • • • , fc, 
is obtained as 

f l Hv) = 7 ^^[Q it) (y)]Q it) (v), t = l,...,T. (13) 
jY(y) 
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Moreover, the k x k matrix F®(y), with elements fu<t-v,u( t )\Y{ u i u \v) arranged by letting 
u run by row and u by column, is obtained as 

F {t \y) = -^ydiag[g(*- 1 )( ? /)]ndiag[m y(t) ]diag[qW(y)], (14) 

for t = 2, . . . , T. 



Appendix 2: derivative of the expected frequencies 

The derivatives of the expected frequencies in (9), (10), and (11) may be obtained by sub- 
stituting in (2), (3), and (4) every posterior probability with the corresponding derivative 
with respect to the parameters of interest. For instance, from (2) we have that the deriva- 
tive matrix 

da' 



90 

has the following elements 



99 j y 99 j 
for y — 1, . . . , c, where 9j is an arbitrary element of 0. 

In order to compute the derivative of f V (t) \ Y {u\y), and also that of fu(t-i),uW \yi w hh 
respect to every parameter 9j we can proceed as in Lystig and Hughes (2002) and Bar- 
tolucci (2006). In particular, let 

,«, %)= « and ^ (y) = m, ( 15 ) 

and let A (i) , and U U) be defined in a similar way as the derivatives of <f> u , A, and II 
with respect to in a similar way also define Finally, the vectors in (15) may be 

obtained by the following recursions: 



diag(mfj ) 1) )A + diag(m y( i))A 0) , t = 1, 

[ +dmg(m y(t) )n'q^\y), t = 2, . . . ,T, 
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and 



q {t \y) 



t = T, 

n^')diag(m ?/ ( t+1 ))q('+ 1 )(y) + ndiag(m$ +1) )^ +1 )(y) + 
{ Udi ag (m y(t+1) )q^\y), t — T — 



Finally, the first derivative of fy(y) with respect to 0j is obtained as fy-\y) — 
(q( T 'fi)'l. In a similar way, considering (13) and (14), we obtain the vector f^' j \y) and 
F^\y), having elements corresponding to the derivatives of fu(t)\y( u \y) an d ,[/(*) \y 

with respect to every parameter 9j 
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